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Extremal Distances for Subtree Transfer Operations 
in Binary Trees 

Ross Atkins and Colin McDiarmid 


Abstract. Three standard subtree transfer operations for binary trees, used in particular for phy¬ 
logenetic trees, are: tree bisection and reconnection (TBR), subtree prune and regraft (SPR) and 
rooted subtree prune and regraft (rSPR). For a pair of leaf-labelled binary trees with n leaves, the 
maximum number of such moves required to transform one into the other is n — 0(y/n), extending a 
result of Ding, Grunewald and Humphries. We show that if the pair is chosen uniformly at random, 
then the expected number of moves required to transfer one into the other is n — 0(n^^^). These 
results may be phrased in terms of agreement forests: we also give extensions for more than two 
binary trees. 


keywords: Phylogenetic Tree, Subtree Prune and Regraft, Tree Bisection and Reconnection, 
Binary Tree, Agreement Forest, Tree Rearrangement 


1. Introduction 

A standard way to transform one binary tree into another is by performing ‘subtree-prune-and-regraft’ 
{SPR) moves (definitions are given below). These operations are of particular interest for phylogenetic 
trees. For a pair of binary trees with n -|- 1 labelled leaves (including the root), it was shown by Allen 
and Steele [2] in 2001 that the maximum number of SPR moves required to change one into the other, 
DsPR{n), is between ^ — o{n) and n — 2. Martin and Thatte [10] show the existence of a common 
subtree of size r2(i/logn), which brings the upper-bound for DspR{n) down to n — f2(-Clogn). 

Ding, Grunewald and Humphries |4] show that DspR{n) is actually n— 0(i/n). We give an extended 
version of this result for the rooted and unrooted cases. We also show that if one of the trees is fixed 
arbitrarily and the other is chosen to maximise the number of SPR moves required to turn one into 
the other, then still n — <d{-Jn) moves are required. We prove the last result by showing that for a 
hxed pair of binary trees, we can label the leaves in such a way that n — &{^/n) moves are required 
to turn one into the other. This result is set in a more general context of agreement forests for sets 
of fc > 2 trees. We also show that if two trees chosen independently, uniformly at random, then the 
expected number of moves required to transform one into the other is n — 0(n^/^) [Theorem I l.llj . 

Subtree transfer operations are subtly different when acting on rooted trees as opposed to un¬ 
rooted trees. Some papers have considered SPR and rSPR moves as acting on different classes of 
binary tree. We give a unified treatment here; when dealing with rooted-subtree-prune-and-regraft 
(rSPR) moves, we insist that one of the leaf-labels be 0, and this leaf acts effectively as the root of 
the tree. We insist that the root has degree 1 and it behaves in exactly the same manner as the other 
leaves for SPR and TBR moves0 


^Sometimes, the root of a rooted-binary-tree is defined to have degree 2. This is equivalent to our definition (see 
Definition II.111 . Simply add a new leaf, adjacent to the degree-2 root, label it 0 and make it the new root. 
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Figure 2. The TBR operation. 


In the subsections below we define binary trees, the three subtree transfer operations TBR, SPR 
and rSPR, and the corresponding metrics they induce over the class of binary trees. 

1.1. Binary Trees 

Definition 1.1. Let A be a non-empty, finite set. If |A| > 1, then a binary leaf-labelled tree with 
leaf-label set A is a finite tree which has 2 types of vertex: 

• tree vertices of degree 3, and 

• leaf vertices of degree 1, labelled with the set A (there is a bijection between the leaves and A). 
If 0 G A, the leaf labelled 0 is called the root of the tree. In the trivial case, when A = {a} is a 
singleton, then the trivial graph containing only one vertex, labelled a, is considered to be a binary 
leaf-labelled tree, and its vertex is called a leaf. 

Let B{X) be the set of all binary leaf-labelled trees, with label set A. For a leaf-labelled tree A, 
let L{A) denote the label set of A; so if A G B{X) then L{A) = A. For trees T and T' in B{X), we 
say T is isomorphic to T' (denoted T = T') if there is a graph isomorphism preserving leaf labels; if 
there is a bijection if : V{T) —>■ V{T') such that uv is an edge in T if and only if ip{x)ilj{y) is an edge 
in T' and for each leaf v oi T, v and ipi^) labelled with the same element of A. For non-empty 
S C L{A), let A|5' be the minimal subtree of A that contains all the leaves with labels in S. Let A/S 
be the tree formed by suppressing any vertices of degree 2 from AjS*. So 

AI S'is a subgraph of A and A/SgB{S). 

In this paper, A will usually be {0,1, 2,..., n}, i.e. |A| = n-I-1. So |-B(A)| is 1,1,3,15,... for 
n = 1, 2, 3,4,... and in general for n > 2, the number is given by 

\B{X)\ = 1 X 3 X 5 X ••• X (2n - 3). 

1.2. Tree Bisection and Reconnection 

Definition 1.2. A tree-bisection-and-reconnection (TBR) on a tree T G B{X), for |A| > 3, is a two 
step process: 
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Figure 3. The SPR move 


1. (bisection step) Delete an edge e from T (this edge is called the bisection edge) and then suppress 
all degree 2 vertices (the edges created by these suppressions are called new edges) to obtain a 
pair Ti,T 2 of binary trees with L{Ti) U L{T 2 ) = X. (The number of new edges is 2, unless the 
bisection edge was incident with a leaf. In the later case, there is only one new edge and either 
Ti or T 2 is an isolated vertex.) 

2. (reconnection step) Connect Ti and T 2 by creating an edge between the midpoint of an edge in 
Ti and the midpoint of an edge in T 2 . This edge is called the reconnecting edge. (In the case that 
Ti or T 2 is an isolated vertex, instead the reconnecting edge is incident to this vertex.) 

See Figured on the left the bisection edge is blue and the green edges will form the new edges, 
and on the right the blue edge is the reconnection edge and the green edges are the new edges. 

li A G B{X), and B is the result of performing a TBR on A, then B € B{X) because B must be 
connected, acyclic, and all non-leaf vertices have degree 3. 

Definition 1.3. Let TBTZ = TBTZ{X) denote the set of pairs (A,B) G B{X)'^ such that B can be 
obtained from A by performing a single TBR move. 

Proposition 1.4. {A, B) G TBTZ if and only if {B, A) G TBTZ. 

Proof. If / is a TBR which transforms A into B, then we can find a TBR, f~^ which transforms 
B into A using the following construction. Firstly use the reconnecting edge of / as the bisection 
edge of f~^. The two components formed will be identical to the two components formed when the 
bisection step of / was performed. If / had two new edges, then we can let the reconnecting edge of 
f~^ be an edge between the midpoints of these edges. If / had one new edge then one part, say T 2 
must be an isolated vertex v, and so we can let the reconnecting edge of f~^ be an edge between v 
and the midpoint of the edge in T which was the new edge of /. □ 

1.3. Subtree Prune and Regraft 

The SPR move (Definition II.5|) is an operation on binary trees whereby a subtree is removed from 
one part of the tree and regrafted to another part of the tree. SPR moves feature in many papers, see 
for example: [2] and [3]. These SPR moves are widely used by tree-searching software packages, like 
PAUP [12] and Garli [13], used in phylogenetic research. 

Definition 1.5. A subtree-prune-and-regraft (SPR) move taking xy from ab to cd is an operation which 
can be performed on a tree T G B{X), as long as xy,ax,xb,cd are distinct edges (vertices a,b,c,d 
need not be distinct; possibly a = c, b = c, a = d oi b = d) and the path from c to y contains x. A 
SPR move is a two step process: 

• (prune step) delete edges ax, xb and replace them with ab, then 

• (regraft step) delete edge cd and replace it with cx and xd. 

Let SVTZ — S'PTZ{X) denote the set of pairs {A, B) such that A G B(X) and B is the result of a SPR 
move on A. If 0 G A and the path from 0 to y passes through x, then this is called a rooted-subtree- 
prune-and-regraft (rSPR) move. Let rS'PTZ denote the set of pairs {A, B) such that B is the result of 
an rSPR move on A. 

See Figure [3] By definition, an rSPR move is a special case of a SPR move. It is not difficult to 
check that a SPR move is a special case of a TBR move. Informally, a SPR move is any TBR move 
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Figure 4. Trees that differ by a single SPR, but at least rSPR moves. 


in which one of Ti or T 2 is an isolated vertex or one of the endpoints of the reconnecting edge is the 
midpoint of one of the new edges. Therefore any valid SPR move performed on a binary tree in B{X) 
will result in a tree in B{X), and moreover: 

rSVTZ C SVTZ C TBll. 

If we think of the edges as being oriented away from the root, then an rSPR move is any SPR move 
that preserves these orientations. 

Proposition 1.6. If {A,B) G SVTZ then {B,A) G SVTZ. Moreover, if {A,B) G rSVTZ then {B,A) G 
rSVTZ. 

Proof Suppose that we start with a tree A and perform the SPR move (or similarly the rSPR move) 
taking xy from ab to cd, and thus obtain B. Then starting at B, we can perform the ‘inverse’ SPR 
move (or rSPR move) taking xy from cd to ab, and thus we obtain A. □ 

Not all SPR moves can be achieved with a single rSPR move. Indeed, we may need an arbitrary 
number of rSPR moves to simulate one SPR move. For example the trees in Figure |4] differ by a single 
SPR by pruning off the single leaf labelled 0, and then regrafting it appropriately. In Example [2] we 
will show that at least rSPR moves are required to transform one into the other. 

The following proposition is Lemma 2.7 (2b) in [5]. 

Proposition 1.7. Any TBR move can be achieved using at most 2 SPRs; if {A,B) G TBTZ, then there 
exists some C such that {A,C), {C,B) G SVTZ. 

1.4. Tree Metrics 

The 5'Pi?-distance lDefinition ll.81) has been the subject of several publications, for example [2], [B], [7]. 
The problem of determining the rSPR-distance between a pair of trees was shown to be TVP-hard by 
Bordewich and Semple [3] in 2005. We shall see later, that all of the values in the following dehnition 
are finite. 

Definition 1.8. For A,B G B{X) and any subtree transfer operation y G {TBR, SPR, rSPR}, define 
the y-distance between A and B (denoted d^{A, B)), to be the minimum number of y moves required 
to change A into B. When discussing rSPR moves, we assume 0 G A. By Propositions 11.41 and 11.61 
this distance is symmetric. For n > 1 and A = [n] U {0}, let D^{n) denote the x-diameter of the class 

Bixy, 

D^(n) := max d,^(A,B) 

^ A,Bel3iX) ^ 

and moreover, let denote the x-™dius of the class S(A); 

R^(n) := min max dy{B,C). 

C&B(X) BeB(X) 

The following is an immediate consequence of rSVR C SVR C TBR. 
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Figure 5. Small values of the radius and diameter 


Proposition 1.9. For any x S {TBR,SPR,rSPR}, and any integer n > 1, we have D^{n) > R^{n). 
Moreover, 

DrSPR{n) > DspR{n) > DTBR{n) and RrSPRin) > RsPR{n) > PTBRiji). 

1.5. Main Theorems 

The values of the radius and diameter for n < 6 are given in Figured The asymptotic values of the 
radius and diameter are given in the following theorem. 

Theorem 1.10. For each x S {TBR, SPR,rSPR}, both D^{n) and Rx{n) are n— <d{y/n). 

By Observation [T9l it suffices to give an upper bound for the rSPR-diameter and a lower bound 
for the TBR-radius, both of the form n — Q{y/n). These bounds are given explicitly in Lemmas 12.51 
and 14.51 respectively. 

Theorem 1.11. If A and B are chosen uniformly at random from B{[n\ U {0}), then 

E[dx(A, B)] =71-0(71^/3) 

for any x G {TBR, SPR, rSPR}. 

Again by Observation [T9l it suffices to give an upper bound for the expected rSPR-distance and 
a lower bound for the expected TBR-distance, both of the form n — 0(n^/^). These bounds are given 
explicitly in Lemmas 13.11 and 15.11 respectively. 


2. Upper Bound for the rSPR Diameter 

The upper-bound in Theorem 11.101 is proved in this section using rooted agreement forests. The 
definition of a rooted agreement forest given here is equivalent to that used by Bordewich and Semple 

0 . 

Definition 2.1. Let 0 € X, and let A C B{X) be a set of fc > 2 trees. A rooted agreement forest of A 
is a partition Li,..., Lm of X, such that for all A,B G A: 

• the subtrees A|Li, A\L 2 , ... , A\Lm are disjoint, and 

• A/{Lj U {0}) = B/{Lj U {0}) for j = 1,..., m. 

Let Mr{A) be the minimal possible value of tti; the minimal number of parts in a rooted agreement 
forest of A. When k = 2, we write Mr{A, B) for Mr{{A, B}). 

When n > 2 and X = {0,1,..., ti}, we can always construct a trivial rooted agreement forest with 
n — 1 parts by setting Li = {0,1, 2} and = {? -|- 1} for z = 2,..., ti — 1. Therefore Mr{A, B) < n—1 
for any A,B€ B{X). Three minimal agreement forests are depicted in Figure IHl 

The following result is proved by Bordewich and Semple [3] , building on the work of Hein et al 
[8]. We give a proof here for completeness. 

Lemma 2.2 (Rooted Agreement Forest Lemma). If A and B are any two trees in B{X), then 

Mr{A, B) = drSPR^A, B) -\- 1. 
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Figure 6. Three minimal rooted agreement forests. 


Proof. Let A = Aq, Ai, ..., = i? be a sequence with (Ai_i, A^) € rSVTZ for all i = 1, 2,..., d. If 

we perform all the prunes of these rSPR moves (and none of the regrafts), then the result will be a 
partition of the leaves into a valid (rooted) agreement forest for A and B. Therefore 

Mr{Ao, Ad) < d + 1 = drSPniAo, Ad) + 1. 

Now we will show that drSPniA, B) + 1 < Mr{A,B), by induction on m := Mr{A,B). For the 
base case m = 1, the only agreement forest with one component would he A = B itself, thence 
drSPpiA, B) = 0. For the inductive step m > 2, it suffices to show that there exists some A' € B(X) 
such that (A, A') £ rSVTZ and Mr{A', B) <m — 1. Let F = (Li,..., Lm) be an agreement forest for 
{A,B) of m > 2 parts and wlog 0 £ Li. We can construct A' from F explicitly: 

• Wlog 51^2 is one of the subtrees of B that is connected to i3|Li by a path pB, which does not 
intersect B\Li for any i > 2 . 

• Let e = yx G E{A) be the first edge on the path pA from A|L 2 to A|Li, where y is in A|L 2 and 
X is not. If cc is a leaf (i.e. if Li = {0} and x = 0 is the root) then we do not perform an rSPR 
move. Otherwise x is a tree vertex. Let a and b be the two neighbours of x other than y. 

• Now we can prune A at e (that is, take xy from its current position on ab) and then regraft it 
to form A' such that 

A7(LiUL2) = S/(LiUL2). 

In the case that Li = {0} is a singleton, this involves regrafting at the only edge incident 
with the root of A. Otherwise (if |Li| > 2) this involves regrafting at an internal edge of A|Li 
corresponding to where the path pB joins B\Li. 

We know that {LiU L 2 , L^, L 4 ,..., L^} is an agreement forest for (A', B), because of for alH > 2: 

• A'|(.^i 0 L 2 ) does not intersect A'\Li, 

• B\{Li A L 2 ) = B\Li UpB A B\L 2 does not intersect B\Li, 

• and A '/(Li U L 2 ) = B/{Li U L 2 ). 

Hence Mr{A',B) < m — 1, which (by the inductive hypothesis) means that drSPR{A\ B) < m — 2 
and therefore drSPniA, B) < m — 1 as required. □ 

The rSPR-distance between a pair of trees is often used as a measure of the discrepancy between 
a pair of binary trees. So Mr{A) can be seen as generalisation of this notion from pairs of trees to 
arbitrary large sets of trees. 

Example. Let A and B be the two caterpillar trees in Figure |4j which differ only by a single SPR 
changing the location of the root. The other leaves in A are ordered in the usual (I, 2, 3,.. .n) order 
from the root, while in B they are in the reverse order. We will show here that Mr{A, B) > By 
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A = 


Figure 7. The construction in Lemma [2.21 





Figure 8. The construction in Lemma [Ql for n = 8. 

the rooted agreement forest lemma fLemma l2.2p this means drSPR{A,B) > ^ and so 

, , n — 3 

DrSPR[n) > —-—. 

To show Mr{A,B) > notice that A/{0, a, &, c} is never isomorphic to B/{0,a,b,c} because the 
leaf neasrest to 0 in A/{0, a, b, c} is the smallest of {a, b, c} while the leaf nearest to 0 in B/{0, a, b, c} 
is the largest. Therefore if {Li, L 2 , ■ ■ ■, Lm} is a rooted agreement forest for {A, B}, with 0 € Li, then 
\Li\ < 3 and \Li\ < 2 for alH > 1. So n < 3 + 2{m — 1) and therefore m > 

The next two lemmas provide an explicit construction to find an agreement forest for arbitrary 
AQB{X). 

Lemma 2.3. Let |X| = n + 1, let T £ B{X) he a binary tree with a root and let a € (l,2n] be a real 
number. It is possible to remove a single edge from T, so that the number of leaves in the component 
not containing the root is in the interval [§,«)• 

Proof. Wlog let 0 G X be the root. For each vertex x G T let Tr{x) be the set of leaves y such that x 
is on the path between 0 and y. For every non-leaf vertex p, 

tt{p) = 7r(ci) U 7r(c2), 

where ci, C 2 G r(p) are the neighbours of p which are not on the path from p to the root. Moreover, 
if u is a leaf then |7r(u)| = 1 (unless u = 0, in which case 7r(u) = X). Therefore, for each tree-node 
there exists a neighbour Vi+i such that 

|7r(u,)| > |7r(u,+i)| > \\Tr{yi)\. 


( 2 . 1 ) 
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Figure 9. A subdivision of a tree into 3 disjoint subtrees, each with 3 leaves. 


Now construct a path {vo,vi,... ,Vm), such that vq is the root, vi is the neighbour of the root and 
equation (EH) holds for each i = 1,2, — 1. This sequence terminates when Vm is a leaf. So the 

sequence decreases from 

|7r(?;i)|=n>^ to \TT{vm)\ = ^ < a, 

and never decreases by a factor less than ^ in a single step. Hence there must be some 1 < j < m 
such that 

^ < k(uj)| < a. 

If edge VjVj-i is deleted, then Tr(vj) is precisely the set of leaves in the component containing Vj. □ 


Lemma 2.4. For |A| = n +1 > 2, letT € B{X) and let a be a real number greater than 1. It is possible 
to divide T into disjoint subtrees such that eaeh leaf lies in a subtree, each subtree contains less than 
a leaves and at most one subtree eontains less than | leaves. 

Proof, (by induction) The case a > n + 1 is trivial. If n + 1 > a, then we can use Lemma l2. 31 iteratively 
to prune off subtrees with less than a (but at least leaves until the remaining tree has less than a 
leaves left (this remainder might have less than | leaves). □ 


Lemma 2.5. For n > k > 2, and any colleetion A = {Ai, ..., Ak} C B{X), 


Mr{A) < n 


1 / n + l \ 
2 (k + 1) yfc + 1/ 


+ 1 . 


Proof, (by construction) Divide each Ai into t 



(possibly empty) disjoint connected parts, 


Ai — Ai \ U Ai 2 1—1 * * ■ U Ai i, 


such that each part contains less than leaves. This is possible by letting a = in Lemma 

12.41 Now, let us define a token to be any map r : [fc] —>■ [t]. For any token r, define 


k 

Ht) ■■= n L{Ai^ri^)) 
2=1 


k 

and 17(t) := y L(A,_^(i)). 

i=l 


We say that r is good if /(r) > 2, and we say a pair r, t' is compatible if t(z) k *• Now if T 

is a set of pairwise compatible, good tokens, then we can construct an agreement forest of n — |T| + I 
parts in the following manner: 

• for each r € T, construct a part of size 2 using labels in /(r), 

• all other labels are put in their own part of size I. 
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Ai A 2 ^3 

Figure 10. Super pairs lie in differed subtrees Ai. 


The paths between labels in each part of this partition will be disjoint because the tokens are compat¬ 
ible. Hence this is a valid agreement forest. Now let T be any maximal set of compatible good tokens 
(i.e. every good token that is not in T is incompatible with a token in T). Then for each label x € X 
either x £ I{a) for some non-good token cr, or a; G U{t) for some r G T. So let us count: 


• there are n -I-1 labels in total, 

• at most non-good tokens, each with at most 1 label in I {a), and 

• less than ka labels in U{t) for each token t £ T. 


Therefore -I- fca|T| > n -|- 1. Substituting t = , we can compute 


n + 1 




ka 


2(fc -|-1) 


Since the above inequality is strict and t, k are both integers, we can tighten the bound to: 


in > ^ ^ ^ k j n+l 

' ' - 2{k + l) 2{k + l)\l k + 1 

as required. 


□ 


Setting fc = 2 in Lemma [2.51 and then applying the rooted agreement forest lemma (12.211 gives 

DrSPRin) <n - 

which yields a suitable upper bound in Theorem 11.101 


3. Upper Bound for the Expectation 

The upper bound in Theorem II. Ill is established in this section. In fact Lemma ld.ll is stronger because 
the underlying unlabelled trees are not chosen randomly. 

Lemma 3.1. Let k >2 be a fixed integer. Let X = [n]yj {0} and let'T he an arbitrary set of k trees in 
B{X). If A is obtained by performing a permutation of the leaf-labels of each T £ L~ (the permutations 
are chosen independently and uniformly at random), then 

¥\Mr{A)] <n-Vl 


as n ^ 00 . 
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Proof. Set b 


fe+1/ kn^~^ 


and s = ■ Now for each T G T, let Ti, T 2 ,..., Tg be disjoint subtrees 


of T such that the number of leaves in each Ti is exactly b. This is possible by Lemma [23] (set a = 2b) 
and taking subtrees as necessary. Let A G Ahe the tree obtained by the random labelling of T, and let 
Ai, A 2 ,..., Ag be the subtrees of A corresponding to Ti, T 2 ,..., Tg respectively. Call {xi,... ,Xj} C X 
a good set if for each A G A, the leaves labelled Xi,... ,Xj all lie in the same A^. We say that a good 
pair of distinct labels {xi,X 2 } is spoiled if for some A G A^ there is another good pair {x^^X/i) in the 
same subtree Ai of A as x\ and X 2 . We call a pair super if it is good but not spoiled. We have 

F{{xi,X2} is good) = X 


because for any A G A the probability that xi G lJi=i L{Ai) is and given that xi G L{Ai) the 
probability that X 2 G L{Ai) is Similarly: 


F{{xi,X 2 ,X 3 } is good) 



6-1 

-X 

n 


b-2 \ 
n — 1J 


Since {xi,X 2 ,X 3 } being good if and only if {xi,X 2 } and {xi,X 3 } are both good, we can see that 


P({xi,X 3 } good|{xi,X 2 } good) = 


6-2 
n — 1 


< 


(3.1) 


Now fix a good pair {xi,X2} with {xi,X2} C L{Ai) for some A G A. If X3 and x^ are chosen from 
L{Ai) \ {xi,X 2 '\, we compute an upper bound on the probability that {cca, X 4 } is a good pair. For each 
A' G A \ A, the probability that X3 G Uj=i given that 0:3 G L(A'), the probability 

that Xi G L(Aj) is at most ^5^. Therefore 


P({x 3 ,a: 4 } good|{xi,a: 2 } good) < 


/ s 6 — 2 
V n — 1 


6-1 
n — 2 


k-l 


< 



(3.2) 


Now let Y denote the set of all pairs {yi,y2} distinct from {a:i,a; 2 }, which he in the same Ai as 
{a:i,a; 2 } for some A G A. Using the union bound, the probability that {a:i,X 2 } is super is bounded 
below by 


P({a:i,a; 2 } good^ 


P({yi,y 2 } good {a:i,X 2 } 

{vi,V2}€Y 


(3.3) 


where the sum ranges over all {yi,y 2 } S Y. There are at most k{^ 2 '^) ®och pairs with {a:i,X 2 } fl 
{ 2 / 112 / 2 } = 0, and at most 2k(b — 2) such pairs with |{a:i,X 2 } H {yi,y 2 }\ = 1- Substituting Equations 
m and (lO) into (13.31) yields the lower bound 


sb 

n + 1 


6-1 




2fc(6- 2) X 



The values 6 = 0 ^ and s = 0 ^n'=+i ^ were chosen so that the factors in the above expression 

are positive and so this bound is . Therefore, if S is the number of super pairs, then 

E[S] = 2 ^^'P({^ 1 ’ 2 : 2 } super) = Q ^ 

Now let us partition the labels such that each super pair forms a part of size two (super pairs are 
disjoint by definition), and all other labels are in individual parts. This partition is a valid agreement 
forest, because if x and y are any super pairs, then for any A G A we know that A\x is contained 
within Ai and A\y is contained within Aj for some i ^ j. Hence Mr{A) < n + 1 — S and so 

E[Mr.{A)] <n + l- EjS”] =n- □ 
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12 3 4 

Figure 11. An example of an unrooted agreement forest. 

If we set fc = 2 in Lemma ITT] and apply the rooted agreement forest lemma lLemma l2.2|) we see 

that 

E[drSPR{A,B)] <n-n(n^/^). 

If T = (A, B) is chosen uniformly at random, then a random relabelling of the leaves would not affect 
the distribution, so the last inequality gives the upper bound in Theorem 1 1.1 II 

4. Lower Bound for the Radius 

The lower bound in Theorem ll.lOl is established in Lemma lT5] at the end of this section. The proofs in 
this section use a notion similar to the rooted agreement forest from the previous section; Definition 
14. H and Lemma IT^ below are analogous to Definition [2T] and Lemma [221 for TBRs instead of rSPRs. 

Definition 4.1. Let A C B{X) be a set of fc > 2 trees. An unrooted agreement forest of A is a partition 
Li,..., Lm of X, such that for all A,BgA: 

• The subtrees A\Li, A\L 2 , ... , A\Lm are disjoint, and 

• A/Lj = BfLj for j = 1,... m. 

Let M{A) be the minimal possible value for m; the minimal number of parts in an unrooted agreement 
forest of A. When fc = 2, we write M{A,B) for M{{A, B}). 

Lemma 4.2 (Unrooted Agreement Forest Lemma). If A and B are any two trees in B{X), then 

MiAB)=dTBRiA,B) + l. 

Proof. Let A = Aq, Ai, .. .Ad = B,he a sequence satisfying (Ai_i, A^) G TBTZ for each i G [d]. If we 
perform all the bisections of these TBR moves (and none of the reconnections), then the result will 
be an agreement forest of at most d + 1 parts. Therefore 

M ( Aq , Am ) ^ d + 1 = dTBR{AQ , Am ) + 1 • 

Now we show m := M{A,B) > dTBR{A,B) + 1, by induction on m. For the base case m = 1, 
any agreement forest with one component must be A = itself, thence dTBR{A,B) = 0. For the 
inductive step m > 2, it suffices to show that there exists some A' with (A, A') G TBTZ, such that 
M{A',B) < m — 1. Let F = {Li,... ,Lm} be an agreement forest for {A,B). We can construct A' 
from F explicitly: 

Step 1 :. Choose e, the bisection edge, arbitrarily from the edges not in A\Li for any i. We know 
that such an edge exists because m > 2. 

The deletion of e forms two non-empty components Ai and A 2 . Note that each of the parts A\Li lie 
completely within one of Ai or A 2 . 
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Vi V2 



Figure 12. The construction in Lemma 


Step 2 Now arbitrarily choose a pair of parts and Ly, such that A\Lx lies in Ai and A\Ly lies 
in A 2 . Let ps be the path in B from a vertex in B\Lx to a vertex in B\Ly. 

Step 3:. Let vi be the last vertex on pB which lies in a part B\Li (not necessarily distinct from 
B\Lx) such that A\Li lies in Ai. Let V 2 be first vertex after vi in ps which lies in a part B\L 2 
(not necessarily distinct from B\Ly). By definition, A\L 2 must lie in A 2 (see Figure [12]). 

So the path between vi and V 2 has no vertices in any B\Li except for its endpoints. Now vi 
corresponds to the midpoint of an edge in B/Li = AjLi, which in turn corresponds to a path pi in 
Ai (or else a: = 1 and Li is a singleton). Similarly V 2 corresponds to a p 2 in A 2 (or L 2 = {r’a})- 

Step 4:. Construct A' to be the reconnection of Ai and A 2 by an edge between the midpoint of 
an edge in pi and the midpoint of an edge in p 2 (if Li or L 2 is a singleton, then simply use its 
vertex as the endpoint of the reconnecting edge). 

By construction, (A, A') € TBTZ because we constructed A' by performing a TBR move on A. Moreover 
A 7 (LiUL 2 ) = B/(iiUL 2 ), so 

F' = {(Li U L2), Fs) ■ • ■) Lm} 

is an agreement forest for (A', B), of size m — 1. □ 

We now present a result concerning permutations (Lemma 14.31) and a property of A that guar¬ 
antees large AI{A) (Lemma 14.41) . These two Lemmas are the two crucial steps required for the proof 
of the main result of this section (Lemma 14.5|1 . 

Lemma 4.3. For any given integers n,bA > ^ such that 2 < b < + 1, let X = {0,1,...,n}. There 

exist k permutations {(j)j ■ X —> X}j^i such that for any distinct x,y G X there exists some i such 
that 

\(t>i(,x) - (l>i{y)\ > b^~^ - 2b’^~^. 


Proof, (by construction) Represent each x G X hy the fc-tuple {xi,X 2 ,... ,Xk) of non-negative integers 
such that 

• X = xib^~^ + X2&^“^ -I--I- Xfe, and 

• 0 < Xi < 6 for all 1 < ( < k. 

If X < 6^', then this is exactly representing x in base b. For larger x, we allow the coefficient of b^~^ 
to exceed 6—1. Now let (fi be the identity permutation, and for each 1 < j < k we define (f>j so that 


(l)j{x) > (fjiy) iff 

See Figure |T3| for an illustration. If dj = Xj — yj 


I Xj < yj or 

1 Xj = yj and x > y. 

then for * > 2 and (i)i{x) > (j)i{y) (i.e. di < 0) we have 

i—1 k 


4>i{x) - 4)i{y) >-dib^ ^ p'^djb^ ^ + X! 

j = l j=i+l 

This is an equality when n -|-1 = 6^ (i.e. when n-|- 1 is a perfect power), and increasing n can only 
increase the number of labels z such that is between ^i(x) and (fi^y). 


Now we fix distinct x,y G X and wlog xi > 1 / 1 . There are four cases: 
1. |xi — 2 /i| > 1 for some i 
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start 


start 


Figure 13. Permutations (pi (left) and <p 2 (right), with n = 27, k = 2 and 6 = 5. 


2 . xi = yi 

3. xi > yi and Xi > yi for all i. 

4. xi > yi and Xj < yj for some j > 2. 

We treat each of these cases separately. 

Case 1:. All 2 such that Zi = min(xi,t/i) + 1 must have (pi{z) between (pi{x) and (pi{y). There are 
at least such 2 . 

Case 2:. Wlog di = —1 and dj = 0 for j < i for some i > 2. Subbing this into equation (HU) yields 

dd) - My) > -1 

> 5^-1 _ 26 '=- 3 . 

Case 3:. If di = 1 and dj > 0 for all j, then \(pi{x) — (piiy)\ = x — y > b^~^. 

Case 4:. We have di = 1 and there is some i > 2, di = —1 and dj > 0 for all j < i. Substituting 

this into equation 631 yields 

Mx) - My) > + b'^-^ - b'^-^ - 6 '=-"-1 

> 6^-1. 



□ 


Lemma 4.4. Let n, fc, t be positive integers, let X = {0,1,..., n} and let A C B{X) he a set of k trees. 
For each A € .A /et A = Ai U A 2 U A 3 U • • • be a partition of the vertices of A into at most t connected 
parts. If for all but z pairs of distinct labels x,y G X there is at least one A € A such that the leaves 
labelled with x and y are in different parts of A, then 

AI (A) >n — kt + k — z + 1. 

Proof. Let {Li, L 2 ,..., Am} be an agreement forest of A and let F = A/Li U A/L 2 U • • • U A/Lm be 
a forest with leaves labelled by X for some A € A (by definition 14.11 it doesn’t matter which A £ A 
we choose). Each tree A £ A can be divided into the forest A = Ai U A 2 U A 3 U • • • by deleting at 
most t — 1 edges. Let these edges be called hot edges. There is a total of at most k{t — 1) hot edges 
(at most < — 1 in each A £ A). An edge in A might not correspond to any edge in F, moreover there 
many be several edges that correspond to the same edge in F. In any case, each hot edge corresponds 
to at most one edge in F. Therefore if we delete all the edges in F corresponding to hot edges, then 
we delete at most k{t — 1) edges. 

The result of the deletion of these edges would be a forest of at least \X\ — z components because z 
is an upper bound on the number of pairs of leaves in the same component. Therefore F must have 
had m>n+\ — z — k(t — 1 ) components. □ 
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Figure 14. Edges 61,62 in A correspond to the same edge in A/{4, 5,6}. 


Lemma 4.5. Let k > 2, let n > (3fc)^' and let X = {0,1,..., n}. If'T = {Ti, T 2 ,..., Ife} is a collection 
of binary trees each with n + 1 leaves, then there exists A = {Ai, A 2 ,..., Ak} C B{X) such that Ai is 
a labelling of the leaves ofTi, and 


M{A) > n — 3/c-v/n + 1. 


Proof. First set 


b = + ij , t = 


2 n 


bk-i _ 


and a = — <6^ ^ — 2 b^ 
t ~ 


Then divide each Ti into at most t connected parts, Ti = Tn U Ti 2 U Tia U • • • (using Lemma r2.4|l so 
each part contains less than a leaves. Now it suffices to label the leaves of each Ti such that for any 
pair of distinct labels x,y G X, there is some i such that x and y are in different parts of Ti, and then 
apply Lemma [4.41 Iwith z = 0). 

To achieve this: First (by Lemma [4.3|) find k permutations of X such that for any two distinct labels 
x,y G X there is some 1 < z < fc such that \4ii{x) — (j)i{y)\ > a. Then assign labels to the leaves of Ti 
in the order given by (j)i, firstly to all the leaves of Tn, then Ta, then T^, and so on. For any distinct 
X and y, consider a permutation (pi in which \4>i(x) — 4>i(.y)\ ^ R- If some contained both x and y, 
then it would have to also contain the labels 


{z : (j)i{x) > 4>i{z) > (j)i{y) or (j)i{x) < (j)i{z) < 4>i{y)}, 


but this is impossible since each has size less than a. Thus by Lemma [4.41 


M (A) > n — k{t — 1) + 1. 
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For constant k, one can easily observe that t ^ 2-y/n as n —>■ oo. For an explicit bound (using 
n > (3fc)^), we can show t < 3-^/ri + 1, in the following manner: 



2n 


fefe-3(52 _ 2) 

2n 

- (1 - A) 

< 371 ^/'= + 1 . 


□ 


If A' = {A'l, ..., A'f^} is obtained from ^ by a permutation, tt, of the labels (i.e. The leaf labeled 
X in Ai is labelled 7r(a;) in A' for all x G X and all z G [fc]) then M{A!) = M{A). Therefore in the 
above Lemma, we could have constructed A so that one of the trees has a particular labelling, say 
Ai. Now if we set k = 2 and apply the unrooted agreement forest lemma (14.2p then 

RTBR{n) >n - 6y/n, 

for all integers n > dcH This is a suitable lower bound for Theorem 1 1.1 01 


5. Lower Bound for the Expectation 


Lemma 5.1. Let k > 2, let n be a positive integer, let X = {0,1,..., n}, and let Ti,T 2 ,... ,Tk be 
binary trees each with n + 1 leaves. If A = {Ai, A 2 ,..., Ak\ C B{X) is chosen uniformly at random 
such that each Ai is a labelling of the leaves of Ti, then 

E[AI{A)] >n-{2k + + 1. 


Proof. First set 


= (n + 1)'=+! 


and t = 


2(u+ 


Then, using Lemma 12.41 divide each Ai into t (possibly empty) connected parts, so that each part has 
less than a leaves: Ai = An U ■■ - U An. For distinct x,y G X, notice that 


P(y G L{A,)\x G L{A,,)) = ^ 

Now let Z be the set of pairs of distinct labels {x,y} such that for each i, x and y are in the same 
part Aij. Explicitly 

Z ■.= {^{x,y} : a; ^ y and Vi 3j s.t. x,y€L{Aij)^. 

Since the labellings are chosen independently, P({x, 2 /} & Z) < for all x,y G X. Now let 

2 : = \Z\. By the linearity of expectation 


E[z] 


(” 2 eZ)<n^ 



<„2/(fe+l). 


^For n < 36 this bound is trivial since RTBRi'f^) > 0 > n — 6\/n. 
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Since t — 1 < and E[z] < we can applying Lemmato get 

E[M(A)] >n-k{t-l)- E[z] + 1 

> n - (2A: + + 1. □ 

Setting fc = 2 in Lemma 15.11 and choosing Ti and T 2 uniformly and independantly, (so that 
A = {^ 1 ,^ 2 } is distributed uniformly) gives E[M(Ai, A 2 )] > n — + 1. Applying the unrooted 

agreement forest lemma g21) gives 

E[dTBR{A, B)]>n - 

which is a suitable lower bound for Theorem II.Ill 

Acknowledgement: We would like to thank Charles Semple for helpful comments. 
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